Vaccine data, Cases data from the hospital, and the waste water
signal data has been loaded, cleaned, and then merged into one final
dataframe, final_data.
The variables have been log transformed to improve correlation between the variables.
Initially without the log trasformation on the data, the wastewater signal and the vaccination in 5 year old age group showed the highest correlation with hospitalization as can be seen in the correlation plot below. To improve the correlation with other vaccination data, log transformation of the data has been used here.
Checking the correlation between lagged values of the waste water signal with current hospital cases. Past 100 lagged values were evaluated. The correlation with the lagged values seem to be higher for the first few lags and then decreases. The correlation for the lagged values however had increased for the log transformed data than it was the original data without log transformation. Also the higher correlations are seen up to 10th lag value whereas in the original data it was seen until 7th lag after which it started to decrease.
## [1] "Correlation for lag of2"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.58
## [1] "Correlation for lag of3"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.57
## [1] "Correlation for lag of4"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.56
## [1] "Correlation for lag of5"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.56
## [1] "Correlation for lag of6"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.55
## [1] "Correlation for lag of7"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.53
## [1] "Correlation for lag of8"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.52
## [1] "Correlation for lag of9"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.51
## [1] "Correlation for lag of10"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.5
## [1] "Correlation for lag of11"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.47
## [1] "Correlation for lag of12"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.45
## [1] "Correlation for lag of13"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.43
## [1] "Correlation for lag of14"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.42
## [1] "Correlation for lag of15"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.4
## [1] "Correlation for lag of16"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.38
## [1] "Correlation for lag of17"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.35
## [1] "Correlation for lag of18"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.33
## [1] "Correlation for lag of19"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.31
## [1] "Correlation for lag of20"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.28
## [1] "Correlation for lag of21"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.26
## [1] "Correlation for lag of22"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.24
## [1] "Correlation for lag of23"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.23
## [1] "Correlation for lag of24"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.2
## [1] "Correlation for lag of25"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.18
## [1] "Correlation for lag of26"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.17
## [1] "Correlation for lag of27"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.16
## [1] "Correlation for lag of28"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.15
## [1] "Correlation for lag of29"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.13
## [1] "Correlation for lag of30"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.12
## [1] "Correlation for lag of31"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.11
## [1] "Correlation for lag of32"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.09
## [1] "Correlation for lag of33"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.07
## [1] "Correlation for lag of34"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.05
## [1] "Correlation for lag of35"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.03
## [1] "Correlation for lag of36"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.02
## [1] "Correlation for lag of37"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.01
## [1] "Correlation for lag of38"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.01
## [1] "Correlation for lag of39"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0
## [1] "Correlation for lag of40"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.01
## [1] "Correlation for lag of41"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.02
## [1] "Correlation for lag of42"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.02
## [1] "Correlation for lag of43"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.03
## [1] "Correlation for lag of44"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.03
## [1] "Correlation for lag of45"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.03
## [1] "Correlation for lag of46"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.03
## [1] "Correlation for lag of47"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.04
## [1] "Correlation for lag of48"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.04
## [1] "Correlation for lag of49"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.05
## [1] "Correlation for lag of50"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.05
## [1] "Correlation for lag of51"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.06
## [1] "Correlation for lag of52"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.06
## [1] "Correlation for lag of53"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.06
## [1] "Correlation for lag of54"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.06
## [1] "Correlation for lag of55"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.05
## [1] "Correlation for lag of56"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.04
## [1] "Correlation for lag of57"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.03
## [1] "Correlation for lag of58"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.04
## [1] "Correlation for lag of59"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.04
## [1] "Correlation for lag of60"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.04
## [1] "Correlation for lag of61"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.03
## [1] "Correlation for lag of62"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.05
## [1] "Correlation for lag of63"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.04
## [1] "Correlation for lag of64"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.04
## [1] "Correlation for lag of65"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.04
## [1] "Correlation for lag of66"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.04
## [1] "Correlation for lag of67"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.04
## [1] "Correlation for lag of68"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.02
## [1] "Correlation for lag of69"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.01
## [1] "Correlation for lag of70"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0
## [1] "Correlation for lag of71"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0
## [1] "Correlation for lag of72"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0
## [1] "Correlation for lag of73"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.01
## [1] "Correlation for lag of74"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.02
## [1] "Correlation for lag of75"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.01
## [1] "Correlation for lag of76"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.02
## [1] "Correlation for lag of77"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.02
## [1] "Correlation for lag of78"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.02
## [1] "Correlation for lag of79"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.02
## [1] "Correlation for lag of80"
## N1_N2_avg
## observed_census_ICU_p_acute_care -0.01
## [1] "Correlation for lag of81"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0
## [1] "Correlation for lag of82"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.01
## [1] "Correlation for lag of83"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.01
## [1] "Correlation for lag of84"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.02
## [1] "Correlation for lag of85"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.03
## [1] "Correlation for lag of86"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.03
## [1] "Correlation for lag of87"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.03
## [1] "Correlation for lag of88"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.04
## [1] "Correlation for lag of89"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.04
## [1] "Correlation for lag of90"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.04
## [1] "Correlation for lag of91"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.05
## [1] "Correlation for lag of92"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.05
## [1] "Correlation for lag of93"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.06
## [1] "Correlation for lag of94"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.07
## [1] "Correlation for lag of95"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.07
## [1] "Correlation for lag of96"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.09
## [1] "Correlation for lag of97"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.11
## [1] "Correlation for lag of98"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.12
## [1] "Correlation for lag of99"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.13
## [1] "Correlation for lag of100"
## N1_N2_avg
## observed_census_ICU_p_acute_care 0.13
Correlation between waste water signal and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_5_1_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_12_1_dose and observed_census_ICU_p_acute_care (hospitalizations).
Checking correlation between Percent_of_Ottawa_residents_18_1_dose and observed_census_ICU_p_acute_care
Correlation between Percent_of_Ottawa_residents_30_1_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_40_1_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_50_1_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_60_1_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_70_1_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_80_1_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_5_2_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_12_2_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_18_2_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_30_2_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_40_2_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_50_2_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_60_2_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_70_2_dose and observed_census_ICU_p_acute_care (hospitalizations).
Correlation between Percent_of_Ottawa_residents_80_2_dose and observed_census_ICU_p_acute_care (hospitalizations).
The scatter plots below do not show a perfectly linear relationship between hospitalization and other variables especially with the vaccine data. The relationship looks non-linear and a piecewise linear model might be a better approach. However, to check for the predictive accuracy of the model, we try both Simple Linear Regression model and MARS model below and compare the accuracy.
Performing Simple Linear Regression on the model for predictive analysis. Data is divided into train and test set and the model has been fit on the training data and validated on test data to check performance with metrics. The training set contains 97% of the data.
## [1] "Percent_of_Ottawa_residents_5_1_dose"
## [2] "Percent_of_Ottawa_residents_12_1_dose"
## [3] "Percent_of_Ottawa_residents_18_1_dose"
## [4] "Percent_of_Ottawa_residents_30_1_dose"
## [5] "Percent_of_Ottawa_residents_40_1_dose"
## [6] "Percent_of_Ottawa_residents_50_1_dose"
## [7] "Percent_of_Ottawa_residents_60_1_dose"
## [8] "Percent_of_Ottawa_residents_70_1_dose"
## [9] "Percent_of_Ottawa_residents_80_1_dose"
## [10] "Percent_of_Ottawa_residents_5_2_dose"
## [11] "Percent_of_Ottawa_residents_12_2_dose"
## [12] "Percent_of_Ottawa_residents_18_2_dose"
## [13] "Percent_of_Ottawa_residents_30_2_dose"
## [14] "Percent_of_Ottawa_residents_40_2_dose"
## [15] "Percent_of_Ottawa_residents_50_2_dose"
## [16] "Percent_of_Ottawa_residents_60_2_dose"
## [17] "Percent_of_Ottawa_residents_70_2_dose"
## [18] "Percent_of_Ottawa_residents_80_2_dose"
## [19] "observed_census_ICU_p_acute_care"
## [20] "N1_N2_avg"
Summary of the Linear Regression model with beta coefficients of the variables used for regression analysis to predict hospitalization and their statistical significance is presented below. The Rsquare is around 87%. Therefore, 87% of variance in the response is being explained by a simple linear regression model.
##
## Call:
## lm(formula = observed_census_ICU_p_acute_care ~ ., data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.22683 -0.18036 0.00988 0.17625 0.89719
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.913738 0.087454 33.317 < 2e-16 ***
## Percent_of_Ottawa_residents_5_1_dose -0.133564 0.052526 -2.543 0.011394 *
## Percent_of_Ottawa_residents_12_1_dose 0.177007 0.117090 1.512 0.131439
## Percent_of_Ottawa_residents_18_1_dose -0.083284 0.243523 -0.342 0.732542
## Percent_of_Ottawa_residents_30_1_dose 0.324004 0.300811 1.077 0.282120
## Percent_of_Ottawa_residents_40_1_dose -0.633988 0.317505 -1.997 0.046564 *
## Percent_of_Ottawa_residents_50_1_dose -0.425634 0.207306 -2.053 0.040742 *
## Percent_of_Ottawa_residents_60_1_dose 0.651911 0.236713 2.754 0.006171 **
## Percent_of_Ottawa_residents_70_1_dose 0.413762 0.113419 3.648 0.000301 ***
## Percent_of_Ottawa_residents_80_1_dose 0.497717 0.094756 5.253 2.51e-07 ***
## Percent_of_Ottawa_residents_5_2_dose 0.677179 0.037940 17.848 < 2e-16 ***
## Percent_of_Ottawa_residents_12_2_dose -1.277322 0.110150 -11.596 < 2e-16 ***
## Percent_of_Ottawa_residents_18_2_dose 0.114402 0.005534 20.674 < 2e-16 ***
## Percent_of_Ottawa_residents_30_2_dose -0.358340 0.247580 -1.447 0.148620
## Percent_of_Ottawa_residents_40_2_dose -0.221773 0.412349 -0.538 0.591011
## Percent_of_Ottawa_residents_50_2_dose -0.600223 0.229280 -2.618 0.009203 **
## Percent_of_Ottawa_residents_60_2_dose -0.131543 0.374880 -0.351 0.725862
## Percent_of_Ottawa_residents_70_2_dose -0.065496 0.218728 -0.299 0.764767
## Percent_of_Ottawa_residents_80_2_dose -0.320780 0.140120 -2.289 0.022610 *
## N1_N2_avg 0.080078 0.133299 0.601 0.548375
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3451 on 379 degrees of freedom
## Multiple R-squared: 0.9131, Adjusted R-squared: 0.9088
## F-statistic: 209.6 on 19 and 379 DF, p-value: < 2.2e-16
Checking residuals in the Simple Linear Regression model and creating a plot of the residuals.
## res
## 329 -0.28781067
## 313 0.05275158
## 95 0.59747430
## 209 -0.37878383
## 351 -0.20156495
## 317 -0.19855165
Even though the errors were not normally distributed when the regression analysis was done on the data without log transformation, however, after log transformation, the errors in linear regression model looks normally distributed.The qqplot, histogram plot and residual plots reflect this.
Calculating metrics from actual vs predicted:
## predicted actual
## 121 84.30219 93
## 193 32.09095 39
## 205 16.96734 21
## 213 25.70477 21
## 219 22.34769 26
## 230 25.20965 28
Root Mean Squared Error for the test data:
## [1] 29.44475
Mape on Test set
## [1] 0.249383
Root Mean Squared Error for the Simple Linear Regression Model on train set:
## [1] 14.4038
Mape on Train set
## [1] 0.2606919
Standard deviation of the actual data
## [1] 35.92902
Plots comparing actual test data and predicted test data by simple linear regression model
This model uses piece wise linear regression to fit the data. It automatically finds the knots in the data to use for piece wise regression.
The model has been fit on the training data which consist of the 97% of the data.
Automated hyperparameter search is done using grid search algorithm using cross validation with 10 folds. Here data is divided into 10 equal sized folds where the validation is performed on each of those 10 folds using the remaining data for training in each scenario. The combination of those hyperparameters that gives the lowest average error metric on those 10 folds is selected by the cross validation algorithm as the most appropriate hyperparameters.
Performing Cross validation grid search with 4 degrees of interaction
terms corresponding to the degree hyperparameter in the
model and number of knots to include in the final pruned model
corresponding to the nprune hyperparameter in the
model.
The data below shows which hyperparameters gives the lowest error metrics (RMSE, Rsquare, and Mean Absolute error).
## Multivariate Adaptive Regression Spline
##
## 399 samples
## 19 predictor
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 360, 360, 359, 360, 359, 358, ...
## Resampling results across tuning parameters:
##
## degree nprune RMSE Rsquared MAE
## 1 5 0.3347450 0.9155202 0.2675747
## 1 10 0.2827323 0.9404094 0.2207624
## 1 15 0.2583996 0.9491761 0.2006350
## 1 20 0.2468224 0.9531259 0.1867214
## 1 25 0.2455320 0.9536011 0.1861822
## 2 5 0.4306936 0.8566350 0.3412134
## 2 10 0.2825427 0.9379481 0.2196733
## 2 15 0.2499818 0.9518572 0.1856844
## 2 20 0.2350463 0.9571385 0.1702083
## 2 25 0.2294475 0.9592087 0.1652835
## 3 5 0.4448370 0.8466339 0.3472776
## 3 10 0.2705762 0.9444388 0.2074108
## 3 15 0.2462877 0.9541068 0.1907308
## 3 20 0.2283323 0.9602594 0.1692556
## 3 25 0.2251070 0.9616913 0.1703719
## 4 5 0.4473331 0.8453710 0.3486220
## 4 10 0.2747770 0.9426524 0.2095868
## 4 15 0.2449439 0.9545137 0.1883902
## 4 20 0.2306693 0.9594563 0.1703174
## 4 25 0.2252363 0.9616512 0.1686914
##
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were nprune = 25 and degree = 3.
Below the model with the lowest RMSE from cross validation search is displayed.
## degree nprune RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 3 25 0.225107 0.9616913 0.1703719 0.03900839 0.01080429 0.01413105
The model has selected 14 predictor variables with 3 degree of interaction terms. The final model has 25 terms which includes the intercept. The Rsquare is 96.2%. The RMSE error in the model is 3.69. Therefore, a very high percentage of variance in the response is being explained by the predictor variables in the model.
For the simple linear regression model the Rsquare was 91.0% and only explains 91% of variance in the response. The RMSE was 26.4. The model has improved 91% to 96% from simple linear regression to MARS model when using log transformed data. However, the improve in model was far more without log transformation in the MARS model as the R square for MARS model using data without log transformation was 98% and the Rsquare for linear regression was 87%. Therefore, for the data without log transformation, there was a greater shift in the model fit statistics.
The MARS model definitely outperforms the simple linear regression model with a lower RMSE and higher Rsquare.
## Call: earth(x=data.frame[399,19], y=c(2.773,3.738,4...), keepxy=TRUE, degree=3,
## nprune=25)
##
## coefficients
## (Intercept) -3.7731
## h(Percent_of_Ottawa_residents_5_1_dose-1.57901) 12.4347
## h(4.51086-Percent_of_Ottawa_residents_12_1_dose) -0.1236
## h(Percent_of_Ottawa_residents_12_1_dose-4.51086) -23.2880
## h(Percent_of_Ottawa_residents_40_1_dose-4.5326) 36.5364
## h(4.62497-Percent_of_Ottawa_residents_70_1_dose) 30.8255
## h(Percent_of_Ottawa_residents_70_1_dose-4.62497) 49.0812
## h(1.09861-Percent_of_Ottawa_residents_80_1_dose) -0.6995
## h(2.77259-Percent_of_Ottawa_residents_5_2_dose) -0.3720
## h(Percent_of_Ottawa_residents_5_2_dose-2.77259) -0.7671
## h(47-Percent_of_Ottawa_residents_18_2_dose) 0.1753
## h(Percent_of_Ottawa_residents_60_2_dose-4.15888) 18.1515
## h(1.57901-Percent_of_Ottawa_residents_5_1_dose) * h(Percent_of_Ottawa_residents_18_1_dose-4.43082) 29.9169
## h(1.57901-Percent_of_Ottawa_residents_5_1_dose) * h(Percent_of_Ottawa_residents_50_2_dose-4.48864) -7.1867
## h(1.57901-Percent_of_Ottawa_residents_5_1_dose) * h(4.58497-Percent_of_Ottawa_residents_70_2_dose) 0.1294
## h(4.51086-Percent_of_Ottawa_residents_12_1_dose) * h(Percent_of_Ottawa_residents_80_2_dose-4.49981) -195.0861
## h(4.51086-Percent_of_Ottawa_residents_12_1_dose) * h(4.49981-Percent_of_Ottawa_residents_80_2_dose) 0.0678
## h(4.5326-Percent_of_Ottawa_residents_40_1_dose) * h(Percent_of_Ottawa_residents_50_1_dose-1.38629) 0.2603
## h(4.5326-Percent_of_Ottawa_residents_40_1_dose) * h(1.38629-Percent_of_Ottawa_residents_50_1_dose) 0.1204
## h(4.62497-Percent_of_Ottawa_residents_70_1_dose) * h(Percent_of_Ottawa_residents_12_2_dose-3.89182) -108.6073
## h(4.62497-Percent_of_Ottawa_residents_70_1_dose) * h(3.89182-Percent_of_Ottawa_residents_12_2_dose) -8.0409
## h(4.63473-Percent_of_Ottawa_residents_80_1_dose) * h(Percent_of_Ottawa_residents_18_2_dose-47) 3.5652
## h(4.40672-Percent_of_Ottawa_residents_12_2_dose) * h(Percent_of_Ottawa_residents_60_2_dose-4.15888) -2.9913
## h(Percent_of_Ottawa_residents_18_2_dose-47) * h(4.56435-Percent_of_Ottawa_residents_70_2_dose) -3.8100
## h(Percent_of_Ottawa_residents_80_1_dose-4.62497) * h(4.40672-Percent_of_Ottawa_residents_12_2_dose) * h(Percent_of_Ottawa_residents_60_2_dose-4.15888) 6893.2836
##
## Selected 25 of 38 terms, and 14 of 19 predictors (nprune=25)
## Termination condition: Reached nk 39
## Importance: Percent_of_Ottawa_residents_18_2_dose, ...
## Number of terms at each degree of interaction: 1 11 12 1
## GCV 0.03984378 RSS 11.4083 GRSq 0.9695456 RSq 0.9780357
From the model below it is clear that the best model is one with 3 degrees of interaction and 25 nprune terms.
Using the best model to predict response variable for the train and test set.
Root Mean Squared Error for the Simple Linear Regression Model on the Test set:
## [1] 3.693824
Mape on Test set
## [1] 0.1366874
Root Mean Squared Error for the Simple Linear Regression Model on Train set:
## [1] 4.695084
Mape on Train set
## [1] 0.1273938
Plots comparing actual test data and predicted test data by MARS model
Checking residuals in the MARS model and creating a plot
Taking the log transformation of the data has definitely improved the correlation between the vaccine data and the hospitalization. The simple linear regression also performs better on log transformed data than the original data as R square statistic showing model fit increased. The RMSE and MAPE score also decreased for simple linear regression model with log transformed data.
However, piece wise linear regression analysis using MARS model still outperformed linear regression giving a lower error metric and better model fit on log transformed data as well.
The errors in MARS model are much smaller than ones obtained from Simple Linear Regression. There are a few outliers. However, the qqplot shows a pretty much close to normal distribution of errors.
MARS scans each predictor to identify a split that improves predictive accuracy, non-informative features will not be chosen. Furthermore, highly correlated predictors do not impede predictive accuracy as much as they do with OLS models.